AITopics | embedding and clustering

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

contrastive pretraining, dataset, embedding and clustering, (13 more...)

arXiv.org Artificial Intelligence

2407.18887

Country:

North America > United States > Montana > Flathead County > Kalispell (0.14)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
North America > Canada (0.04)
(18 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Neural Information Processing SystemsApr-6-2023, 16:53:16 GMT

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low di(cid:173) mensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to non(cid:173) linear dimensionality reduction that has locality preserving prop(cid:173) erties and a natural connection to clustering. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high di(cid:173) mensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However, the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 .

cid, laplacian eigenmap and spectral technique, manifold, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining (0.61)
Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Belkin, Mikhail, Niyogi, Partha

Neural Information Processing SystemsDec-31-2002

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering.

data mining, machine learning, manifold, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Data Science > Data Mining (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)

Add feedback

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Belkin, Mikhail, Niyogi, Partha

Neural Information Processing SystemsDec-31-2002

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifoldembedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionalityreduction that has locality preserving properties and a natural connection to clustering.

artificial intelligence, graph, manifold, (13 more...)

Neural Information Processing Systems

Country: